gh-121109: Fix performance of tarfile reading with "r|*" by TomiBelan · Pull Request #121296 · python/cpython

TomiBelan · 2024-07-02T20:53:25Z

This PR fixes #121109.

Using the test files and test script described in the issue:

filename	mode	time with PR
`test.tar.gz`	`r:*`	1.075s
`test.tar.gz`	`r\|*`	0.812s
`test.tar.xz`	`r:*`	1.066s
`test.tar.xz`	`r\|*`	1.053s
`test.tar.bz2`	`r:*`	0.913s
`test.tar.bz2`	`r\|*`	0.896s

After this PR, tf.list() of r|* is the same speed as r:*, as expected. Not orders of magnitude slower.

Issue: tarfile "r|*" (stream mode) is much slower than "r:*" #121109

ghost · 2024-07-02T20:53:27Z

All commit authors signed the Contributor License Agreement.

danifus · 2024-07-04T02:21:38Z

+        if len(t) > size:
+            raise ReadError("decompress() returned too much data")


Do you see any scenario where this would be triggered? Looking at the zlib, bz2 and lzma decompressor docs for max_length, it looks like this shouldn't occur?

I have no issue with it being here but just checking if I'm missing something :)

Right - it can only happen if there is a bug in the zlib, bz2 or lzma decompressor. I haven't checked their C source, but their docs say it should not occur.
It's not a necessary check, but I figured I'd add it just in case.

We make assumptions about decompressors above. That they cannot make the loops infinite (when unconsumed_tail is not empty or needs_input is false, but decompress() returns an empty bytes), or that decompress() does not produce data by tiny chunks, making the code inefficient. If we trust them there, we should trust them here. This check only distracts. I suggest to remove it or replace with assert.

github-actions · 2026-04-17T06:37:01Z

This PR is stale because it has been open for 30 days with no activity.

TomiBelan · 2026-04-17T11:35:42Z

My dearest stale bot, I wish it was only 30 days! 😢

TomiBelan · 2026-05-23T21:30:59Z

Re @serhiy-storchaka #121109 (comment)

@TomiBelan, could you please test how your change affects the case of reading files byte-by-byte or by small chunks.

It's not needed. Small reads are already well-exercised by test_tarfile.py, especially StreamReadTest. When I add a print() to _Stream._read(), it shows a variety of size values during the test, e.g. 0, 1, 512, 4096, 7011, 10239, 10240.

This becomes clearer when you realize _Stream is just the outer shell, and the tar format parser itself also needs to read small chunks sometimes.

But all right. I also tested it with this script, which succeeded.

rm -rf data; mkdir data; for i in 1 2 3; do head -c1M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test1M.tar.gz data ; tar caf test1M.tar.xz data ; tar caf test1M.tar.bz2 data ; tar caf test1M.tar.zst data
rm -rf data; mkdir data; for i in 1 2 3; do head -c100M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test100M.tar.gz data ; tar caf test100M.tar.xz data ; tar caf test100M.tar.bz2 data ; tar caf test100M.tar.zst data

import sys
import tarfile
for filename in ('test1M.tar.gz', 'test1M.tar.xz', 'test1M.tar.bz2', 'test1M.tar.zst'):
    for mode in ('r|*', 'r:*'):
        for chunk_size in (1, 10000, 500000):
            print('running:', filename, mode, chunk_size, file=sys.stderr)
            with tarfile.open(filename, mode) as tf:
                for tarinfo in tf:
                    if tarinfo.isreg():
                        with tf.extractfile(tarinfo) as extractf:
                            total = 0
                            while True:
                                buf = extractf.read(chunk_size)
                                if not buf: break
                                total += len(buf)
                                assert buf == b'x' * len(buf)
                                assert len(buf) == chunk_size or total == tarinfo.size

Full disclosure: this script does what you asked for, but it actually isn't a very good test. extractfile() returns a io.BufferedReader. So the 1 byte read and the 10000 byte read both become 131072 byte reads.

And a benchmark:

import sys
import time
import tarfile
for filename in ('test100M.tar.gz', 'test100M.tar.xz', 'test100M.tar.bz2', 'test100M.tar.zst'):
    for mode in ('r|*', 'r:*'):
        print('running:', filename, mode, file=sys.stderr)
        start = time.time()
        with tarfile.open(filename, mode) as tf:
            tf.list()
        print('took', time.time() - start, file=sys.stderr)

I got 1.3, 1.2, 1.9, 1.5, 1.1, 1.1, 0.2, 0.2 seconds. (This is a different machine than last time.)

TomiBelan · 2026-05-23T21:52:38Z

I made some changes:

Rebased to main.
Updated the zstd case, which didn't exist when this PR was created. (It's waiting for review almost 2 years...)
I rewrote the PR because I find my original patch hard to understand. 😳 This new version completely separates the gzip and non-gzip case. It's longer overall, and some bits are duplicated, but "explicit is better than implicit" - I hope it's clearer and easier to review.
HOWEVER: If a reviewer prefers the old patch (24110fb), I'd be happy to revert e61bccf.

serhiy-storchaka

Thank you for your benchmarks, @TomiBelan. Could you please also test the case of listing a tarfile containing a large number of files? Maybe in that case we can exercise small reads?

And please test also the case of random uncompressable data. This is far from common case, but we need to look at corners too.

Your code looks correct. Note that for zlib decompressor we still have an issue of creating new bytes objects for unconsumed_tail, especially when reading by small chunks. It is smaller than the original issue, because the size of unconsumed_tail is limited by bufsize, while dbuf in the old code could be much larger. We can treat this in separate issue if it is worth to do.

Other solution for this issue could be keeping the position in dbuf instead of modifying dbuf. The advantage of your solution is that it also limits the amount of consumed memory (dbuf can be very large). We could mitigate this by combining two approaches and using decompress(cbuf, max(bufsize, size - c)). But this can make the code more complicated, and I am not sure that it will be faster.

You can make experiments or leave this to other issue. I'll approve this PR after getting the results of new benchmarks if they are satisfying.

serhiy-storchaka · 2026-05-28T13:02:42Z

+        if len(t) > size:
+            raise ReadError("decompress() returned too much data")


We make assumptions about decompressors above. That they cannot make the loops infinite (when unconsumed_tail is not empty or needs_input is false, but decompress() returns an empty bytes), or that decompress() does not produce data by tiny chunks, making the code inefficient. If we trust them there, we should trust them here. This check only distracts. I suggest to remove it or replace with assert.

TomiBelan requested a review from ethanfurman as a code owner July 2, 2024 20:53

bedevere-app Bot mentioned this pull request Jul 2, 2024

tarfile "r|*" (stream mode) is much slower than "r:*" #121109

Open

bedevere-app Bot added the awaiting review label Jul 2, 2024

danifus approved these changes Jul 4, 2024

View reviewed changes

bedevere-app Bot added awaiting core review and removed awaiting review labels Jul 4, 2024

github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 17, 2026

github-actions Bot removed the stale Stale PR or inactive for long period of time. label May 13, 2026

TomiBelan added 2 commits May 23, 2026 21:11

Fix performance of tarfile reading with "r|*"

24110fb

Merge zstd additions

42a7a3d

Refactor by splitting gzip and non-gzip branch

e61bccf

TomiBelan force-pushed the slowtar branch from 5aa9c08 to e61bccf Compare May 23, 2026 21:35

serhiy-storchaka reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-121109: Fix performance of tarfile reading with "r|*"#121296

gh-121109: Fix performance of tarfile reading with "r|*"#121296
TomiBelan wants to merge 3 commits into
python:mainfrom
TomiBelan:slowtar

TomiBelan commented Jul 2, 2024 •

edited by bedevere-app Bot

Loading

Uh oh!

ghost commented Jul 2, 2024 •

edited by ghost

Loading

Uh oh!

danifus Jul 4, 2024

Uh oh!

TomiBelan Jul 4, 2024

Uh oh!

serhiy-storchaka May 28, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

TomiBelan commented Apr 17, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if len(t) > size:
		raise ReadError("decompress() returned too much data")

Uh oh!

Conversation

TomiBelan commented Jul 2, 2024 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 2, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danifus Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

TomiBelan Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka May 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

TomiBelan commented Apr 17, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TomiBelan commented Jul 2, 2024 •

edited by bedevere-app Bot

Loading

ghost commented Jul 2, 2024 •

edited by ghost

Loading